Usage Background

1. Project Background

As described on the PingCAP official website, the main use cases of TiCDC are "database disaster recovery" and "data integration". Enthusiasts familiar with the TiDB ecosystem might know about "TiDB Binlog", a tool with similar functionality to TiCDC. So why reinvent the wheel?
The answer lies in the limitations of "TiDB Binlog," which cannot address the following (not exhaustive) issues. For those unfamiliar with "TiDB Binlog," refer to the TiDB Binlog Overview:

Poor scalability of "TiDB Binlog," such as the "single point" and "performance" issues of Drainer. Splitting Drainer cannot ensure "row change order."
Low performance of "TiDB Binlog," such as only supporting single Kafka single partition writes, which limits Kafka throughput.
Poor generality of "TiDB Binlog," such as requiring Reparo to parse SQL statements when writing binlog files, and not supporting general protocols like Maxwell or Canal.
Synchronization pain points of "TiDB Binlog," such as being limited by GRPC max message size, e.g., Pump's gRPC message exceeds limit.
Poor availability of "TiDB Binlog," such as synchronization interruptions when the single-point Drainer fails, with no self-healing capabilities.

Thus, TiCDC was born. By directly capturing TiKV Change Logs, TiCDC uses tables as split units to schedule data synchronization to various Captures, solving scalability issues. It supports multi-partition Kafka writes and general protocols like Maxwell and Canal, addressing synchronization performance and ecosystem compatibility issues. When synchronization times out, Capture has uneven table distribution, or Capture crashes, TiCDC's scheduling mechanism and "At Least Once" guarantee ensure data integrity and self-healing.

2. Tool Position

To understand how to use TiCDC, one must first clarify its position and role in the TiDB ecosystem.
1. In terms of role: If a database is a "one-way street," it will inevitably be abandoned by the market, becoming a so-called "data island." TiCDC balances synchronization performance, data integrity, consistency, and ecosystem compatibility, creating a closed loop of "data inflow -> TiDB -> data outflow." Additionally, discussing performance without considering scenarios is meaningless. No product can excel in all scenarios, and TiDB has its comfort zones and pain points. With TiCDC, users can easily enable data flow and use TiDB in its optimal scenarios.

2. In terms of position: TiCDC sits between TiDB and downstream destinations, which include all products and platforms compatible with MySQL communication protocols (e.g., TiDB, MySQL, RDS, third-party cloud platforms). Users can also develop custom solutions based on general message queue output protocols. Currently, TiCDC supports protocols like Open Protocol, Canal-JSON, Canal, Avro, and Maxwell. Some protocols are only partially supported. For details, refer to --> TiCDC Supported Protocols Guide.
Additionally, TiCDC covers almost all mainstream "data synchronization" use cases. The diagram below illustrates the classic combination of "MySQL --> TiDB --> Others." For other protocol databases (Oracle, PG), the method to flow into TiDB is equivalent to flowing into MySQL, as TiDB is compatible with MySQL communication protocols, ensuring mature solutions exist.

3. Usage Scenarios

Note: The following companies were identified through articles on AskTUG, where shared articles mentioned their use of TiCDC. These companies are only those discoverable through search. Many commercial customers either request confidentiality or have not yet disclosed their usage.

3.1 Xiaohongshu

From Zhang Yihao's sharing in 【PingCAP Infra Meetup】No.131 TiCDC's Ecosystem and Community Development, Xiaohongshu uses TiCDC for content review, note tag recommendations, and growth audits. The architecture is as follows: "TiDB --> TiCDC --> Kafka --> Flink --> TiDB," enabling aggregation with other data sources.

3.2 Haier Smart Home

From Yao Xiang's sharing in 【PingCAP Infra Meetup】No.131 TiCDC's Ecosystem and Community Development, Haier Smart Home uses TiCDC for search and recommendations. User data and smart home information are processed through "TiDB --> TiCDC --> Kafka --> ES," achieving near real-time search functionality with Kafka consuming approximately 3 million messages daily. As of the 2019-09-19 sharing, TiCDC's synchronization boundary under normal conditions is in the "millisecond" range, and under jitter conditions, it is within "seconds" (less than 10s).
Additionally, from Github -- Question and Bug Reports, TiCDC has a roadmap for improving synchronization performance, such as mqSink flushes data synchronously, causing low performance and improve performance of syncer. TiCDC's synchronization performance is a continuous optimization process.

3.3 360

From Dai Xiaolei's sharing in BLOG - TiCDC Application Scenarios Analysis, 360 uses TiCDC to meet the needs of incremental data extraction, same-city dual-cluster hot backup, and stream processing.

4. Usage Methods

4.1 Deploying TiCDC

The following test environment sets up TiCDC for testing and learning its synchronization capabilities. The CDC components on 172.16.6.155:8300 and 172.16.6.196:8300 will appear after scaling.

shell

[tidb@Linux-Hostname ~]$ tiup cluster display tidb-test

ID                  Role          Host          Ports        OS/Arch       Status  Data Dir                                     Deploy Dir
--                  ----          ----          -----        -------       ------  --------                                     ----------
172.16.6.155:9093   alertmanager  172.16.6.155  9093/9094    linux/x86_64  Up      /home/data/tidb-home/data/alertmanager-9093  /home/data/tidb-deploy/alertmanager-9093
172.16.6.155:8300   cdc           172.16.6.155  8300         linux/x86_64  Up      /data/deploy/install/data/cdc-8300           /home/data/tidb-deploy/cdc-8300
172.16.6.196:8300   cdc           172.16.6.196  8300         linux/x86_64  Up      /data/deploy/install/data/cdc-8300           /home/data/tidb-deploy/cdc-8300
172.16.6.155:3000   grafana       172.16.6.155  3000         linux/x86_64  Up      -                                            /home/data/tidb-deploy/grafana-3000
172.16.6.155:2379   pd            172.16.6.155  2379/2380    linux/x86_64  Up|UI   /home/data/tidb-home/data/pd-2379            /home/data/tidb-deploy/pd-2379
172.16.6.194:2379   pd            172.16.6.194  2379/2380    linux/x86_64  Up      /home/data/tidb-home/data/pd-2379            /home/data/tidb-deploy/pd-2379
172.16.6.196:2379   pd            172.16.6.196  2379/2380    linux/x86_64  Up|L    /home/data/tidb-home/data/pd-2379            /home/data/tidb-deploy/pd-2379
172.16.6.155:9090   prometheus    172.16.6.155  9090/12020   linux/x86_64  Up      /home/data/tidb-home/data/prometheus-9090    /home/data/tidb-deploy/prometheus-9090
172.16.6.155:4000   tidb          172.16.6.155  4000/10080   linux/x86_64  Up      -                                            /home/data/tidb-deploy/tidb-4000
172.16.6.196:4000   tidb          172.16.6.196  4000/10080   linux/x86_64  Up      -                                            /home/data/tidb-deploy/tidb-4000
172.16.6.155:20160  tikv          172.16.6.155  20160/20180  linux/x86_64  Up      /home/data/tidb-home/data/tikv-20160         /home/data/tidb-deploy/tikv-20160
172.16.6.194:20160  tikv          172.16.6.194  20160/20180  linux/x86_64  Up      /home/data/tidb-home/data/tikv-20160         /home/data/tidb-deploy/tikv-20160
172.16.6.196:20160  tikv          172.16.6.196  20160/20180  linux/x86_64  Up      /home/data/tidb-home/data/tikv-20160         /home/data/tidb-deploy/tikv-20160

TiCDC reached GA in version 4.0.6, and it is recommended to use version 4.0.6 and above. For details, refer to the --> TiCDC Official Documentation. The scaling steps are as follows.

shell

[tidb@Linux-Hostname ~]$ cat scale-out-cdc.yaml 
cdc_servers:
  - host: 172.16.6.155
    gc-ttl: 86400
    data_dir: /data/deploy/install/data/cdc-8300
  - host: 172.16.6.196
    gc-ttl: 86400
    data_dir: /data/deploy/install/data/cdc-8300

[tidb@Linux-Hostname ~]$ tiup cluster scale-out tidb-test scale-out-cdc.yaml

4.2 Synchronization to Kafka

As I had no prior experience with Kafka, I referred to the Kafka Quickstart to set up a Kafka environment on 172.16.6.155 and 172.16.6.196 for testing. The goal was to parse messages sent to Kafka by TiCDC.

shell

# Install Java Runtime Environment
[tidb@kafka1 ~]$ yum install java-1.8.0-openjdk.x86_64
[tidb@kafka1 ~]$ java -version
openjdk version "1.8.0_322"

# Set up Kafka on 172.16.6.155 and 172.16.6.196
[tidb@kafka1 ~]$  wget https://dlcdn.apache.org/kafka/3.1.0/kafka_2.13-3.1g/kafka/3.1.0/kafka_2.13-3.1.0.tgz --no-check-certificate
[tidb@kafka1 ~]$  tar -xzf kafka_2.13-3.1.0.tgz
[tidb@kafka1 ~]$  cd kafka_2.13-3.1.0
[tidb@kafka1 ~]$  bin/zookeeper-server-start.sh config/zookeeper.properties
[tidb@kafka1 ~]$  bin/kafka-server-start.sh config/server.properties


# Create a topic named tidb-test with 2 partitions
[tidb@kafka1 ~]$  bin/kafka-topics.sh --create --topic tidb-test \
    --bootstrap-server  172.16.6.155:9092,172.16.6.196:9092 \
    --partitions 2 
Created topic quickstart-events.


# Describe the tidb-test topic to verify its configuration
[tidb@kafka1 ~]$ bin/kafka-topics.sh --describe --topic tidb-test \
    --bootstrap-server 172.16.6.155:9092,172.16.6.196:9092
Topic: tidb-test TopicId: OjM6FFrBQtqyEB4sVWmYCQ PartitionCount: 2 ReplicationFactor: 1 Configs: segment.bytes=1073741824
 Topic: tidb-test Partition: 0 Leader: 0 Replicas: 0 Isr: 0
 Topic: tidb-test Partition: 1 Leader: 0 Replicas: 0 Isr: 0

# Create a changefeed in TiCDC to synchronize data to Kafka, using the manually created tidb-test topic
[tidb@kafka1 ~]$ tiup cdc cli changefeed create --pd=http://172.16.6.155:2379 \
    --sink-uri="kafka://172.16.6.155:9092/tidb-test?protocol=canal-json&kafka-version=2.4.0&partition-num=2&max-message-bytes=67108864&replication-factor=1"
tiup is checking updates for component cdc ...
Starting component `cdc`: /home/tidb/.tiup/components/cdc/v6.0.0/cdc /home/tidb/.tiup/components/cdc/v6.0.0/cdc cli changefeed create --pd=http://172.16.6.155:2379 --sink-uri=kafka://172.16.6.155:9092/tidb-test?protocol=canal-json&kafka-version=2.4.0&partition-num=2&max-message-bytes=67108864&replication-factor=1
[2022/05/12 16:37:09.667 +08:00] [WARN] [kafka.go:416] ["topic's `max.message.bytes` less than the `max-message-bytes`,use topic's `max.message.bytes` to initialize the Kafka producer"] [max.message.bytes=1048588] [max-message-bytes=67108864]
[2022/05/12 16:37:09.667 +08:00] [WARN] [kafka.go:425] ["topic already exist, TiCDC will not create the topic"] [topic=tidb-test] [detail="{\"NumPartitions\":2,\"ReplicationFactor\":1,\"ReplicaAssignment\":{\"0\":[0],\"1\":[0]},\"ConfigEntries\":{\"segment.bytes\":\"1073741824\"}}"]
Create changefeed successfully!
......

# Use sysbench to insert 2 rows into jan.sbtest1   
[tidb@kafka1 ~]$ sysbench oltp_read_write --mysql-host=127.0.0.1 \
    --mysql-port=4000 \
    --mysql-db=jan \
    --mysql-user=root \
    --mysql-password= \
    --table_size=2 \
    --tables=1  prepare

# Parse the data changes in Kafka as follows   
[tidb@kafka1 ~]$ bin/kafka-console-consumer.sh --topic tidb-test \
    --from-beginning --bootstrap-server 172.16.6.155:9092,172.16.6.196:9092
{"id":0,"database":"jan","table":"sbtest1","pkNames":["id"],"isDdl":false,"type":"INSERT","es":1652344970654,"ts":1652344977538,"sql":"","sqlType":{"c":1,"id":4,"k":4,"pad":1},"mysqlType":{"c":"char","id":"int","k":"int","pad":"char"},"data":[{"c":"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","id":"1","k":"1","pad":"67847967377-48000963322-62604785301-91415491898-96926520291"}],"old":null}
{"id":0,"database":"jan","table":"sbtest1","pkNames":["id"],"isDdl":false,"type":"INSERT","es":1652344970654,"ts":1652344977538,"sql":"","sqlType":{"c":1,"id":4,"k":4,"pad":1},"mysqlType":{"c":"char","id":"int","k":"int","pad":"char"},"data":[{"c":"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","id":"2","k":"2","pad":"23183251411-36241541236-31706421314-92007079971-60663066966"}],"old":null}
{"id":0,"database":"jan","table":"","pkNames":null,"isDdl":true,"type":"QUERY","es":1652344966953,"ts":1652344969305,"sql":"CREATE DATABASE `jan`","sqlType":null,"mysqlType":null,"data":null,"old":null}
{"id":0,"database":"jan","table":"sbtest1","pkNames":null,"isDdl":true,"type":"CREATE","es":1652344970604,"ts":1652344975305,"sql":"CREATE TABLE `sbtest1` (`id` INT NOT NULL AUTO_INCREMENT,`k` INT DEFAULT _UTF8MB4'0' NOT NULL,`c` CHAR(120) DEFAULT _UTF8MB4'' NOT NULL,`pad` CHAR(60) DEFAULT _UTF8MB4'' NOT NULL,PRIMARY KEY(`id`)) ENGINE = innodb","sqlType":null,"mysqlType":null,"data":null,"old":null}
{"id":0,"database":"jan","table":"sbtest1","pkNames":null,"isDdl":true,"type":"CINDEX","es":1652344973455,"ts":1652344978504,"sql":"CREATE INDEX `k_1` ON `sbtest1` (`k`)","sqlType":null,"mysqlType":null,"data":null,"old":null}

# Clean up the Kafka environment set up on 172.16.6.155 and 172.16.6.196
# First, use ctrl + c to close all foreground processes started in terminals
[tidb@kafka1 ~]$ rm -rf /tmp/kafka-logs /tmp/zookeeper

4.3 Synchronization to MySQL

To synchronize data to MySQL, install MySQL on 172.16.6.155:3306. After configuring the relevant parameters, a subscription task with the ID simple-replication-task can be created. The synchronization progress can be monitored on the Grafana dashboard. MySQL is a familiar tool for DBAs, so the steps for verifying synchronization are not elaborated here.

shell

[tidb@Linux-Hostname ~]$ tiup cdc cli changefeed create --pd=http://172.16.6.155:2379 \
  --sink-uri="mysql://jan:123123@172.16.6.155:3306/?time-zone=&worker-count=16&max-txn-row=5000" \
  --changefeed-id="simple-replication-task" \
  --sort-engine="unified" \
  --config=changefeed.toml 

tiup is checking updates for component cdc ...
Starting component `cdc`: /home/tidb/.tiup/components/cdc/v6.0.0/cdc /home/tidb/.tiup/components/cdc/v6.0.0/cdc cli changefeed create --pd=http://172.16.6.155:2379 --sink-uri=mysql://jan:123123@172.16.6.155:3306/?time-zone=&worker-count=16&max-txn-row=5000 --changefeed-id=simple-replication-task --sort-engine=unified --config=changefeed.toml
[2022/04/12 16:58:40.954 +08:00] [WARN] [mysql_params.go:143] ["max-txn-row too large"] [original=5000] [override=2048]
Create changefeed successfully!
......

5. Referenced Articles

1. 【PingCAP Infra Meetup】No.131 Why and how we build TiCDC
2. 【PingCAP Infra Meetup】No.131 TiCDC 的生态和社区建设 3. 官方文档 -- TiCDC
4. 官方文档 -- TiDB Binlog
5. 官方 FAQ -- Pump 的 gRPC message 超过限值
 6. Github Design Doc -- TiCDC 支持的消息队列输出协议指南
 7. Github Question and Bug Reports -- TiCDC 提升性能规划路径
 8. AskTUG Blog -- 代晓磊 TiCDC 应用场景解析
 9. 官方文档 -- Kafka

Additionally, AskTUG, as the official support community for the TiDB ecosystem, is recommended for discussions here https://tidb.net/blog/70588c4c.

Usage Background ​

1. Project Background ​

2. Tool Position ​

3. Usage Scenarios ​

3.1 Xiaohongshu ​

3.2 Haier Smart Home ​

3.3 360 ​

4. Usage Methods ​

4.1 Deploying TiCDC ​

4.2 Synchronization to Kafka ​

4.3 Synchronization to MySQL ​

5. Referenced Articles ​

Usage Background

1. Project Background

2. Tool Position

3. Usage Scenarios

3.1 Xiaohongshu

3.2 Haier Smart Home

3.3 360

4. Usage Methods

4.1 Deploying TiCDC

4.2 Synchronization to Kafka

4.3 Synchronization to MySQL

5. Referenced Articles